feat(celery): Add task and worker lifecycle metrics#4439
Open
diogosilva30 wants to merge 3 commits intoopen-telemetry:mainfrom
Open
feat(celery): Add task and worker lifecycle metrics#4439diogosilva30 wants to merge 3 commits intoopen-telemetry:mainfrom
diogosilva30 wants to merge 3 commits intoopen-telemetry:mainfrom
Conversation
45f5df3 to
94a36e1
Compare
16069a0 to
c6ae222
Compare
Introduce Flower-compatible metrics for Celery task and worker events: - flower.events.total: counter for task-sent, task-received, task-started, task-succeeded, task-failed, task-retried, task-revoked - flower.task.runtime.seconds: histogram of task execution time - flower.worker.number.of.currently.executing.tasks: gauge of in-flight tasks per worker - flower.worker.online: gauge tracking worker online/offline status CeleryInstrumentor (child process) handles task tracing and task-level metrics. CeleryWorkerInstrumentor (main process) handles worker lifecycle signals (worker_ready, worker_shutdown) and main-process events (task_received, task_revoked). Prefetch-time metrics (task_prefetch_time_seconds, worker_prefetched_tasks) are intentionally omitted — Celery splits task_received and task_prerun across processes in prefork mode, making it impossible to compute the delta without external event consumption.
c6ae222 to
a136151
Compare
Contributor
Author
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Description
Implements a subset of the Prometheus metrics exposed by Celery Flower directly in the Celery instrumentation, so users who only need basic task/worker observability no longer need to run Flower as a separate service.
Fixes #3458
Type of change
Changes
Task-level metrics (in
CeleryInstrumentor)Metric tracking wired into the existing
prerun/postrun/failure/retryhandlers:flower_events_totalflower.events.totalflower_task_runtime_secondsflower.task.runtime.secondsflower_worker_number_of_currently_executing_tasksflower.worker.number.of.currently.executing.tasksWorker lifecycle metrics (new
CeleryWorkerInstrumentor)A separate instrumentor that hooks into
worker_ready/worker_shutdown/task_received/task_revokedsignals in the main worker process:flower_events_totalflower.events.totalflower_worker_onlineflower.worker.onlineRegistered as a new
celery_workerentry point inpyproject.toml.Omitted metrics
flower_task_prefetch_time_secondsandflower_worker_prefetched_tasksare not implemented. Inpool=preforkmode (the default),task_receivedfires in the main process whiletask_prerunfires in a child process — there is no shared state to compute the time delta between them. Flower can do this because it consumes Celery events externally in a single process; replicating that inside the worker is not feasible without producer-side instrumentation or protocol changes.Bug fix — memory leak in
task_id_to_start_timePreviously,
task_id_to_start_timewas a class-level dict that was never cleaned up after task completion, growing unbounded over time. It is now instance-scoped and entries are removed in_trace_postrun.Housekeeping
__init__.pyandutils.pyCeleryGetterproperly (Getter[Request])detach_context/retrieve_context/retrieve_task_id_from_messageinutils.pydictmetrics store with typed_CeleryTaskMetrics/_CeleryWorkerMetricsdataclassesHow Has This Been Tested?
test_metrics.py, ~1100 new lines)worker_ready/worker_shutdownsignalsTrying it out
A self-contained example project (Django + Celery + Docker Compose + Grafana LGTM) is available at
celery-metrics-example(separate branch on the fork, not part of this PR).Quick start:
Then open Grafana at http://localhost:3000 (
admin/admin) and explore metrics prefixed withflower.*.Does This PR Require a Core Repo Change?
Checklist: